To process loan applications in a more efficient manner and understand factors of good credit vs. bad credit, we apply various machine learning algorithms to a dataset that has been manually evaluated by expert decision-makers. Through this process, we are able to build a predictive model to assess the probability an application will be good or bad. We then use the expected value of accepting good or bad applications to guide a recommended process for auto-accepting, auto-rejecting, or manually reviewing applications. Our process allows us to review applications at the extreme ends of the spectrum quicker, and then use the outcomes of a manual review process to further improve our model. We are able to achieve a better customer experience and reduced cost to review applications.
Business Background
German Credit is a money-lending organization that provides loans to its customers. Humans review the applications and provide a recommendation for credit using information they have collected from the customer. Upon review, the decision maker evaluates if the customer has “good credit” or “bad credit” and uses that determination to accept or reject the application.
Business Problem & Analytics Solution
The current process for German Credit to review applications and respond to the customer takes a lot of time. The decision maker needs to acquire any relevant data from its database on the customer (if they are a current customer of the instituion), review the status of the application, and then get together as a panel to decide whether to move forward with the application.
The current process takes a week to complete, but German Credit has received feedback from customers and internal research that speed of approval is very important to the customer satisfaction level as customers would like to be able to action on their use of the loan quicker. German Credit believes that being able to decrease the turnaround time (TAT) of loan applications will increase the retention rate of its loan services.
There are two methods being explored to increase the TAT of the loan application process. First, the decision makers have agreed upon a set of variables that inform their decision process. A data pipeline has been created to provide them with access to relevant data on the application without the need for them to track it down from various places. Second, we will explore using machine learning to model their decision-making criteria. The output of this model will be used to determine current factors decision makers rely on the most when making their decisions and to score future applications in an effort to “auto-accept” customers with good credit.
To assist in the development of the decision support model, a panel of decision makers have scored a sample of 1000 applications with “good credit” or “bad credit” using 30 variables from the customer’s application and any current dealings with German Credit.
Lastly, German Credit have provided the cost/benefit of the application process.
The goal for the analytics product will be to develop a predictive model to use in place of the current application process. We will factor in the cost/benefit of predictions to select the model that maximizes expected profit. Based on model performance, we will recommend a cut off point for triaging applications between “auto-accept”, “needs review”, and “auto-reject”.
Based on the problem at hand, our task will be to use the set of decision criteria to see if we can accurately replicate the current decision maker’s process. The goal is to operationalize their expert opinion to increase the speed of the application process, increase the throughput of applications, and reduce the costs associated with reviewing applications. Each of these business goals are believed to have a positive impact on overall customer satisfaction which should increase customer retention.
Below we will explore the dataset of applications and determine the feasibility of a solution. The variables represent applicant information such as the purpose of their credit, past dealings with the bank, various financial information, and other administrative information about the customer.
Most interesting, it appears this dataset is entirely male based on the variables described as “Applicant is male and [insert relationship stats]”. This is important to note as if this project moves forward, we will need to consdier building an additional model to account for a different population of applicants and performance may suffer.
# load data definitions provided
definitions <- read_csv("data/DataDescription.csv")
# display in table
kable(definitions, type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Variable Name | Description | Variable Type |
|---|---|---|
| OBS# | Observation No. | Categorical |
| CHK_ACCT | Checking account status | Categorical |
| DURATION | Duration of credit in months | Numerical |
| HISTORY | Credit history | Categorical |
| NEW_CAR | Purpose of credit | Binary |
| USED_CAR | Purpose of credit | Binary |
| FURNITURE | Purpose of credit | Binary |
| RADIO/TV | Purpose of credit | Binary |
| EDUCATION | Purpose of credit | Binary |
| RETRAINING | Purpose of credit | Binary |
| AMOUNT | Credit amount | Numerical |
| SAV_ACCT | Average balance in savings account | Categorical |
| EMPLOYMENT | Present employment since | Categorical |
| INSTALL_RATE | Installment rate as % of disposable income | Numerical |
| MALE_DIV | Applicant is male and divorced | Binary |
| MALE_SINGLE | Applicant is male and single | Binary |
| MALE_MAR_WID | Applicant is male and married or a widower | Binary |
| CO-APPLICANT | Application has a co-applicant | Binary |
| GUARANTOR | Applicant has a guarantor | Binary |
| PRESENT_RESIDENT | Present resident since - years | Categorical |
| REAL_ESTATE | Applicant owns real estate | Binary |
| PROP_UNKN_NONE | Applicant owns no property (or unknown) | Binary |
| AGE | Age in years | Numerical |
| OTHER_INSTALL | Applicant has other installment plan credit | Binary |
| RENT | Applicant rents | Binary |
| OWN_RES | Applicant owns residence | Binary |
| NUM_CREDITS | Number of existing credits at this bank | Numerical |
| JOB | Nature of job | Categorical |
| NUM_DEPENDENTS | Number of people for whom liable to provide maintenance | Numerical |
| TELEPHONE | Applicant has phone in his or her name | Binary |
| FOREIGN | Foreign worker | Binary |
| RESPONSE | Credit rating is good | Binary |
First we will pre-process our variables based on our data definitions provided. In some cases we truncate the label for plotting purposes.
# load our dataset
credit <- read_csv("data/GermanCredit.csv")
# convert categorical variables to factors with their labels
credit$CHK_ACCT <- factor(credit$CHK_ACCT,
labels = c("< 0", "0 - <200", "200+", "None"))
credit$HISTORY <- factor(credit$HISTORY,
labels = c("none", "current", "existing current",
"previous delays", "critical account"))
credit$SAV_ACCT <- factor(credit$SAV_ACCT,
labels = c("<100", "100 - <500",
"500 - <1000", "1000+",
"unknown/none"))
credit$EMPLOYMENT <- factor(credit$EMPLOYMENT,
labels = c("unemployed", "<1 year",
"1 - <4 years", "4 - <7 years",
"7+ years"))
# note the data dictionary shows 3 being >4 years, however that would invalidate this column for the case study
# in the real world, this discrepancy would need to be addressed
credit$PRESENT_RESIDENT <- factor(credit$PRESENT_RESIDENT,
labels = c("<= 1 year", ">1 - 2 years", ">2 - 3 years", ">3 years"))
credit$JOB <- factor(credit$JOB,
labels = c("none/NA", #"unemployed/unskilled - non-resident",
"low", #"unskilled - resident",
"medium", #"skilled employee / official",
"high")) #management/ self-employed/ highly qualified employee/ officer"))
credit$RESPONSE <- factor(credit$RESPONSE,
labels = c("Bad", "Good"))
credit <- credit %>%
rename(RADIO_TV = `RADIO/TV`,
CO_APPLICANT = `CO-APPLICANT`)
credit.obs <- credit$`OBS#`
credit <- subset(credit, select = -`OBS#`)
Before we partition our data into a training and test set for modeling, let’s first ensure we don’t have any holes we’ll need to account for.
# create function to run summary on numeric features
df_num_summary <- function(df, cols = NULL) {
if (is.null(cols)) {
num.cols <- colnames(select_if(df, is.numeric))
} else {
num.cols <- cols
}
df <- subset(df, select = num.cols)
df.num.summmary <- data.frame(
Count = round(sapply(df, length), 2),
Miss = round((sapply(df, function(x) sum(length(which(is.na(x)))) / length(x)) * 100), 1),
Card. = round(sapply(df, function(x) length(unique(x))), 2),
Min. = round(sapply(df, min, na.rm = TRUE), 2),
`25 perc.` = round(sapply(df, function(x) quantile(x, 0.25, na.rm = TRUE)), 2),
Median = round(sapply(df, median, na.rm = TRUE), 2),
Mean = round(sapply(df, mean, na.rm = TRUE), 2),
`75 perc.` = round(sapply(df, function(x) quantile(x, 0.75, na.rm = TRUE)), 2),
Max = round(sapply(df, max, na.rm = TRUE), 2),
`Std Dev.` = round(sapply(df, sd, na.rm = TRUE), 2)
) %>%
rename(`1st Qrt.` = X25.perc.,
`3rd Qrt.` = X75.perc.,
`Miss Pct.` = Miss)
return(df.num.summmary)
}
# create function to run summary on categorical features
df_cat_summary <- function(df, cols = NULL) {
if (is.null(cols)) {
cat.cols <- colnames(select_if(df, is.factor))
} else {
cat.cols <- cols
}
df <- subset(df, select = cat.cols)
df.cat.summary <- data.frame(
Count = round(sapply(df, length), 2),
Miss = round(sapply(df, function(x) sum(length(which(is.na(x)))) / length(x)), 2),
Card. = round(sapply(df, function(x) length(unique(x))), 2),
Mode = names(sapply(df, function(x) sort(table(x), decreasing = TRUE)[1])),
Mode_Freq = sapply(df, function(x) sort(table(x), decreasing = TRUE)[1]),
Mode_pct = round((sapply(df, function(x) sort(table(x),
decreasing = TRUE)[1] / length(x)) * 100), 1),
Mode_2 = names(sapply(df, function(x) sort(table(x), decreasing = TRUE)[2])),
Mode_Freq_2 = sapply(df, function(x) sort(table(x), decreasing = TRUE)[2]),
Mode_pct_2 = round((sapply(df, function(x) sort(table(x),
decreasing = TRUE)[2] / length(x)) * 100), 1)
)
df.cat.summary$Mode <- gsub("^.*\\.","", df.cat.summary$Mode)
df.cat.summary$Mode_2 <- gsub("^.*\\.","", df.cat.summary$Mode_2)
df.cat.summary <- df.cat.summary %>%
rename(`Miss Pct.` = Miss,
`Mode Freq.` = Mode_Freq,
`Mode Pct.` = Mode_pct,
`2nd Mode` = Mode_2,
`2nd Mode Freq.` = Mode_Freq_2,
`2nd Mode Pct.` = Mode_pct_2
)
return(df.cat.summary)
}
We do not see any missing values from our numeric variables. Our DURATION and AMOUNT variables appear to be right skewed. We may need to account for potential outliers for the AMOUNT variable.
credit.num.summary <- df_num_summary(df = credit)
# display in table
kable(credit.num.summary, type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Count | Miss Pct. | Card. | Min. | 1st Qrt. | Median | Mean | 3rd Qrt. | Max | Std.Dev. | |
|---|---|---|---|---|---|---|---|---|---|---|
| DURATION | 1000 | 0 | 33 | 4 | 12.0 | 18.0 | 20.90 | 24.00 | 72 | 12.06 |
| NEW_CAR | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.23 | 0.00 | 1 | 0.42 |
| USED_CAR | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.10 | 0.00 | 1 | 0.30 |
| FURNITURE | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.18 | 0.00 | 1 | 0.39 |
| RADIO_TV | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.28 | 1.00 | 1 | 0.45 |
| EDUCATION | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.05 | 0.00 | 1 | 0.22 |
| RETRAINING | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.10 | 0.00 | 1 | 0.30 |
| AMOUNT | 1000 | 0 | 921 | 250 | 1365.5 | 2319.5 | 3271.26 | 3972.25 | 18424 | 2822.74 |
| INSTALL_RATE | 1000 | 0 | 4 | 1 | 2.0 | 3.0 | 2.97 | 4.00 | 4 | 1.12 |
| MALE_DIV | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.05 | 0.00 | 1 | 0.22 |
| MALE_SINGLE | 1000 | 0 | 2 | 0 | 0.0 | 1.0 | 0.55 | 1.00 | 1 | 0.50 |
| MALE_MAR_or_WID | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.09 | 0.00 | 1 | 0.29 |
| CO_APPLICANT | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.04 | 0.00 | 1 | 0.20 |
| GUARANTOR | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.05 | 0.00 | 1 | 0.22 |
| REAL_ESTATE | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.28 | 1.00 | 1 | 0.45 |
| PROP_UNKN_NONE | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.15 | 0.00 | 1 | 0.36 |
| AGE | 1000 | 0 | 53 | 19 | 27.0 | 33.0 | 35.55 | 42.00 | 75 | 11.38 |
| OTHER_INSTALL | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.19 | 0.00 | 1 | 0.39 |
| RENT | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.18 | 0.00 | 1 | 0.38 |
| OWN_RES | 1000 | 0 | 2 | 0 | 0.0 | 1.0 | 0.71 | 1.00 | 1 | 0.45 |
| NUM_CREDITS | 1000 | 0 | 4 | 1 | 1.0 | 1.0 | 1.41 | 2.00 | 4 | 0.58 |
| NUM_DEPENDENTS | 1000 | 0 | 2 | 1 | 1.0 | 1.0 | 1.16 | 1.00 | 2 | 0.36 |
| TELEPHONE | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.40 | 1.00 | 1 | 0.49 |
| FOREIGN | 1000 | 0 | 2 | 0 | 0.0 | 0.0 | 0.04 | 0.00 | 1 | 0.19 |
We do not have any missing categorical variables. We see no variable has a level greater than 5.
credit.cat.summary <- df_cat_summary(df = credit)
# display in table
kable(credit.cat.summary, type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Count | Miss Pct. | Card. | Mode | Mode Freq. | Mode Pct. | 2nd Mode | 2nd Mode Freq. | 2nd Mode Pct. | |
|---|---|---|---|---|---|---|---|---|---|
| CHK_ACCT | 1000 | 0 | 4 | None | 394 | 39.4 | < 0 | 274 | 27.4 |
| HISTORY | 1000 | 0 | 5 | existing current | 530 | 53.0 | critical account | 293 | 29.3 |
| SAV_ACCT | 1000 | 0 | 5 | <100 | 603 | 60.3 | unknown/none | 183 | 18.3 |
| EMPLOYMENT | 1000 | 0 | 5 | 1 - <4 years | 339 | 33.9 | 7+ years | 253 | 25.3 |
| PRESENT_RESIDENT | 1000 | 0 | 4 | >3 years | 413 | 41.3 | >1 - 2 years | 308 | 30.8 |
| JOB | 1000 | 0 | 4 | medium | 630 | 63.0 | low | 200 | 20.0 |
| RESPONSE | 1000 | 0 | 2 | Good | 700 | 70.0 | Bad | 300 | 30.0 |
Before plotting plotting the distributions of our variables, we will partition our data into a train and test set. The training set will use 80% of our total records and we will use cross-validation to tune our models. The test set will be saved for later to help guide the recommendations for model selection and deployment.
# set random seed for reproducibility
set.seed(123)
# create 80/20 split of train and test data indices
trainIndex <- createDataPartition(credit$RESPONSE, p = .8,
list = FALSE,
times = 1)
# create partitions
credit.train <- credit[trainIndex, ]
credit.valid <- credit[-trainIndex, ]
# set theme
theme_set(theme_classic())
# add colorblind-friendly palette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
After partitioning our data, we will just be using the training set to identify opportunities in developing our model.
Summary Statistics
credit.train.num.summary <- df_num_summary(df = credit.train)
# display in table
kable(credit.train.num.summary, type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Count | Miss Pct. | Card. | Min. | 1st Qrt. | Median | Mean | 3rd Qrt. | Max | Std.Dev. | |
|---|---|---|---|---|---|---|---|---|---|---|
| DURATION | 800 | 0 | 33 | 4 | 12 | 18.0 | 20.95 | 24.0 | 72 | 12.00 |
| NEW_CAR | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.22 | 0.0 | 1 | 0.42 |
| USED_CAR | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.11 | 0.0 | 1 | 0.31 |
| FURNITURE | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.18 | 0.0 | 1 | 0.38 |
| RADIO_TV | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.28 | 1.0 | 1 | 0.45 |
| EDUCATION | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.05 | 0.0 | 1 | 0.22 |
| RETRAINING | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.10 | 0.0 | 1 | 0.30 |
| AMOUNT | 800 | 0 | 752 | 250 | 1382 | 2350.5 | 3308.29 | 3968.5 | 18424 | 2845.80 |
| INSTALL_RATE | 800 | 0 | 4 | 1 | 2 | 3.0 | 2.98 | 4.0 | 4 | 1.12 |
| MALE_DIV | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.05 | 0.0 | 1 | 0.22 |
| MALE_SINGLE | 800 | 0 | 2 | 0 | 0 | 1.0 | 0.55 | 1.0 | 1 | 0.50 |
| MALE_MAR_or_WID | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.08 | 0.0 | 1 | 0.28 |
| CO_APPLICANT | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.04 | 0.0 | 1 | 0.19 |
| GUARANTOR | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.05 | 0.0 | 1 | 0.23 |
| REAL_ESTATE | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.28 | 1.0 | 1 | 0.45 |
| PROP_UNKN_NONE | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.16 | 0.0 | 1 | 0.37 |
| AGE | 800 | 0 | 53 | 19 | 27 | 33.0 | 35.79 | 42.0 | 75 | 11.51 |
| OTHER_INSTALL | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.19 | 0.0 | 1 | 0.39 |
| RENT | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.19 | 0.0 | 1 | 0.39 |
| OWN_RES | 800 | 0 | 2 | 0 | 0 | 1.0 | 0.70 | 1.0 | 1 | 0.46 |
| NUM_CREDITS | 800 | 0 | 4 | 1 | 1 | 1.0 | 1.41 | 2.0 | 4 | 0.59 |
| NUM_DEPENDENTS | 800 | 0 | 2 | 1 | 1 | 1.0 | 1.14 | 1.0 | 2 | 0.35 |
| TELEPHONE | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.40 | 1.0 | 1 | 0.49 |
| FOREIGN | 800 | 0 | 2 | 0 | 0 | 0.0 | 0.04 | 0.0 | 1 | 0.18 |
credit.train.cat.summary <- df_cat_summary(df = credit.train)
# display in table
kable(credit.train.cat.summary, type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Count | Miss Pct. | Card. | Mode | Mode Freq. | Mode Pct. | 2nd Mode | 2nd Mode Freq. | 2nd Mode Pct. | |
|---|---|---|---|---|---|---|---|---|---|
| CHK_ACCT | 800 | 0 | 4 | None | 314 | 39.2 | < 0 | 223 | 27.9 |
| HISTORY | 800 | 0 | 5 | existing current | 420 | 52.5 | critical account | 237 | 29.6 |
| SAV_ACCT | 800 | 0 | 5 | <100 | 483 | 60.4 | unknown/none | 147 | 18.4 |
| EMPLOYMENT | 800 | 0 | 5 | 1 - <4 years | 269 | 33.6 | 7+ years | 207 | 25.9 |
| PRESENT_RESIDENT | 800 | 0 | 4 | >3 years | 337 | 42.1 | >1 - 2 years | 236 | 29.5 |
| JOB | 800 | 0 | 4 | medium | 508 | 63.5 | low | 152 | 19.0 |
| RESPONSE | 800 | 0 | 2 | Good | 560 | 70.0 | Bad | 240 | 30.0 |
Plots
Continuous Features
The DURATION distributions confirm a right-skewed distribution. Additionally, it appears Bad credit is associated with longer credit durations.
# duration
ggplot(credit.train, aes(x = DURATION)) +
geom_histogram(aes(y =..density..), color = "black", fill = "grey", binwidth = 1) +
geom_density(fill= "grey", alpha = 0.4)
ggplot(credit.train, aes(x = DURATION, fill = RESPONSE)) +
#geom_histogram(aes(y =..density..), color = "black", position = "identity", binwidth = 1, alpha = 0.5) +
geom_density(alpha = 0.4) +
scale_fill_manual(values = c(cbPalette[7], cbPalette[4]))
ggplot(credit.train, aes(x = RESPONSE, y = DURATION, fill = RESPONSE)) +
geom_boxplot() +
coord_flip() +
scale_fill_manual(values = c(cbPalette[7], cbPalette[4]))
The AMOUNT variable is also right-skewed with Bad credit showing slightly larger credit amounts.
# amount
ggplot(credit.train, aes(x = AMOUNT)) +
geom_histogram(aes(y =..density..), color = "black", fill = "grey", binwidth = 100) +
geom_density(fill= "grey", alpha = 0.4)
ggplot(credit.train, aes(x = AMOUNT, fill = RESPONSE)) +
#geom_histogram(aes(y =..density..), color = "black", position = "identity", binwidth = 250, alpha = 0.5) +
geom_density(alpha = 0.4) +
scale_fill_manual(values = c(cbPalette[7], cbPalette[4]))
ggplot(credit.train, aes(x = RESPONSE, y = AMOUNT, fill = RESPONSE)) +
geom_boxplot() +
coord_flip() +
scale_fill_manual(values = c(cbPalette[7], cbPalette[4]))
The age of applicants tends to be lower for Bad credit labels compared to Good credit labels.
# age
ggplot(credit.train, aes(x = AGE)) +
geom_histogram(aes(y =..density..), color = "black", fill = "grey", binwidth = 2) +
geom_density(color = "grey", alpha = 0.4)
ggplot(credit.train, aes(x = AGE, fill = RESPONSE)) +
#geom_histogram(aes(y =..density..), color = "black", position = "identity", binwidth = 2, alpha = 0.5) +
geom_density(alpha = 0.4) +
scale_fill_manual(values = c(cbPalette[7], cbPalette[4]))
ggplot(credit.train, aes(x = RESPONSE, y = AGE, fill = RESPONSE)) +
geom_boxplot() +
coord_flip() +
scale_fill_manual(values = c(cbPalette[7], cbPalette[4]))
Binary Features
In plotting binary features by our target, NEW_CAR, USED_CAR, and RENT display interesting variability in seperating the target feature.
bin.vars <- c("NEW_CAR", "USED_CAR", "FURNITURE", "RADIO_TV", "EDUCATION", "RETRAINING",
"MALE_DIV", "MALE_SINGLE", "MALE_MAR_or_WID",
"CO_APPLICANT", "GUARANTOR", "REAL_ESTATE", "PROP_UNKN_NONE", "OTHER_INSTALL",
"RENT", "OWN_RES", "TELEPHONE", "FOREIGN")
for (var in bin.vars) {
form <- as.formula(paste("~ ", var, "+ RESPONSE"))
mosaicplot(form, data = credit.train,
col = c(cbPalette[7], cbPalette[4]),
cex.axis = 0.45,
xlab = var,
main = "")
}
Categorical Features
In reviewing the distributions of our categorical variables compared to the target variable, CHK_ACCT stands out as a potentially important variable.
cat.vars <- c("CHK_ACCT", "HISTORY", "SAV_ACCT", "EMPLOYMENT", "INSTALL_RATE",
"PRESENT_RESIDENT", "NUM_CREDITS", "JOB", "NUM_DEPENDENTS")
for (var in cat.vars) {
form <- as.formula(paste("~ ", var, "+ RESPONSE"))
mosaicplot(form, data = credit.train,
col = c(cbPalette[7], cbPalette[4]),
cex.axis = 0.45,
xlab = var,
main = "")
}
Multivariate Analysis
In reviewing correlations of our numeric variables, we see a strong negative correlation between OWN_RES and RENT, which is unsurprising. DURATION and AMOUNT also show a positive correlation.
drop.vars <- c("CHK_ACCT", "HISTORY", "SAV_ACCT", "EMPLOYMENT", "INSTALL_RATE",
"PRESENT_RESIDENT", "NUM_CREDITS", "JOB", "NUM_DEPENDENTS")
# isolate variables for correlation
credit.corr <- credit.train %>%
select(-one_of(drop.vars)) %>%
mutate(RESPONSE = ifelse(RESPONSE == "Good", 1, 0))
# create correlation matrix
cor.mat <- round(cor(credit.corr), 2)
melted.cor.mat <- melt(cor.mat)
# plot correlation heatmap
ggplot(melted.cor.mat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "red", high = "blue", mid = "white",
name = "Pearson Correlation") +
labs(title = "German Credit Correlation Heatmap") +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 1.5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.x = element_blank(), axis.title.y = element_blank())
When comparing the mean values of DURATION, AMOUNT, and AGE across Bad and Good credit, we see statistically significant differences. These will most likely be important variables to separate our classes.
credit.mean <- credit.train %>%
select(DURATION, AMOUNT, AGE, RESPONSE)
categories <- colnames(credit.mean)
credit.ttest <- data.frame(Category = categories[1:3],
p_value = rep(0,3))
# loop to run through each variable for ttest
for (i in 1:nrow(credit.ttest)) {
var <- categories[i]
p <- t.test(get(var) ~ RESPONSE, data = credit.mean)$p.value
credit.ttest[i, 2] <- round(p, 4)
}
# display in table
kable(credit.ttest, type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Category | p_value |
|---|---|
| DURATION | 0.0000 |
| AMOUNT | 0.0004 |
| AGE | 0.0011 |
Testing the proportions of our binary variables shows some potentially significant differences for variables OWN_RES, PROP_UNKN_NONE, OTHER_INSTALL, and some others. We are running multiple tests, so there are likely false positives in our test results.
credit.prop <- credit.train %>%
select(-DURATION, -AMOUNT, -INSTALL_RATE, -AGE, -NUM_CREDITS,
-NUM_DEPENDENTS, -CHK_ACCT, -HISTORY, -SAV_ACCT, -EMPLOYMENT,
-PRESENT_RESIDENT, -JOB)
categories <- colnames(credit.prop)
credit.prop.test <- data.frame(Category = categories[1:18],
p_value = rep(0,18))
# loop to run through each variable for ttest
for (i in 1:nrow(credit.prop.test)) {
var <- categories[i]
dat <- credit.prop %>% select(RESPONSE, var)
test.table <- table(dat)
test.table <- test.table[ , c(2, 1)]
p <- prop.test(test.table)$p.value
credit.prop.test[i, 2] <- round(p, 4)
}
# display in table
credit.prop.test %>%
arrange(p_value) %>%
kable(type = "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left",
latex_options = "scale_down")
| Category | p_value |
|---|---|
| OWN_RES | 0.0001 |
| PROP_UNKN_NONE | 0.0009 |
| OTHER_INSTALL | 0.0022 |
| RADIO_TV | 0.0026 |
| USED_CAR | 0.0027 |
| REAL_ESTATE | 0.0032 |
| RENT | 0.0042 |
| NEW_CAR | 0.0126 |
| EDUCATION | 0.0515 |
| MALE_SINGLE | 0.0543 |
| CO_APPLICANT | 0.0676 |
| FOREIGN | 0.1016 |
| GUARANTOR | 0.1323 |
| TELEPHONE | 0.1394 |
| RETRAINING | 0.2472 |
| FURNITURE | 0.4685 |
| MALE_DIV | 0.5954 |
| MALE_MAR_or_WID | 1.0000 |
Before moving forward with prototyping various models, we want to first remove predictors that have near zero variance in the training sample. These are variables that are likely to be uninformative for our modeling. We see two variables have been removed due to low variance.
# remove variables with low variance
nzv <- nearZeroVar(credit.train)
credit.train <- credit.train[ , -nzv]
credit.valid <- credit.valid[ , -nzv]
print(nzv)
## [1] 17 30
The first model we build is a logistic regression using all of the remaining variables. Using a 10-fold repeated cross-validation, we produce an accuracy of about 75%. This is an improvement above the baseline of 70%, which is the proportion of Good samples in our dataset. We use the Box-Cox transformation to handle any skewness of our variables.
# add controls for training model
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE
#summaryFunction = twoClassSummary
)
# set seed to compare models
set.seed(123)
glm.all <- train(RESPONSE ~ ., data = credit.train,
method = "glm",
#metric = "ROC",
preProc = c("BoxCox"),
trControl = ctrl)
glm.all
## Generalized Linear Model
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: Box-Cox transformation (5)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7565 0.3900469
Using our initial logistic model, we use backwards stepwise selection to preduce our predictors. This produces a mild improvement in our accuracy while reducing our model to 18 predictors.
glm.all.step <- glm(RESPONSE ~ ., data = credit.train, family = "binomial")
glm.step.back <- step(glm.all.step, direction = "backward", trace = 0)
# set seed to compare models
set.seed(123)
glm.back <- train(glm.step.back$formula, data = credit.train,
method = "glm",
#metric = "ROC",
preProc = c("BoxCox"),
trControl = ctrl)
glm.back
## Generalized Linear Model
##
## 800 samples
## 18 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: Box-Cox transformation (5)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results:
##
## Accuracy Kappa
## 0.76375 0.4050818
Next, we use forward stepwise selection on our logisitc model. This method reduces our model to 17 predictors without much of a difference in performance from our full and backwards logistic models.
min.model <- glm(RESPONSE ~ 1, data = credit.train, family = "binomial")
biggest <- formula(glm( RESPONSE ~ ., data = credit.train, family = "binomial"))
glm.step.for <- step(min.model, direction = "forward", scope = biggest, trace = 0)
# set seed to compare models
set.seed(123)
glm.for <- train(glm.step.for$formula, data = credit.train,
method = "glm",
#metric = "ROC",
preProc = c("BoxCox"),
trControl = ctrl)
glm.for
## Generalized Linear Model
##
## 800 samples
## 17 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: Box-Cox transformation (5)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results:
##
## Accuracy Kappa
## 0.75975 0.3925573
Next, we increase explore using classification trees with a random search over the parameter grid. In testing 10 different models, we max our accuracy out at around 72%.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
#summaryFunction = twoClassSummary,
search = "random"
)
# set seed to compare models
set.seed(123)
ct <- train(RESPONSE ~ ., data = credit.train,
method = "rpart",
#metric = "ROC",
trControl = ctrl,
tuneLength = 10
)
ct
## CART
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.000000000 0.70350 0.26566236
## 0.005555556 0.71350 0.26938587
## 0.006944444 0.72075 0.28005186
## 0.010000000 0.72200 0.26630999
## 0.020833333 0.71325 0.23582428
## 0.027083333 0.70725 0.21055744
## 0.045833333 0.70025 0.09415541
## 0.047916667 0.69500 0.05228968
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01.
Using a slightly different classification tree method, we max our accuracy out at 72% with a tree depth of 11.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
#summaryFunction = twoClassSummary,
search = "random"
)
# set seed to compare models
set.seed(123)
ct2 <- train(RESPONSE ~ ., data = credit.train,
method = "rpart2",
#metric = "ROC",
trControl = ctrl,
tuneLength = 10)
ct2
## CART
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 2 0.71125 0.1909295
## 3 0.71175 0.2201995
## 4 0.71500 0.2355226
## 8 0.72050 0.2598728
## 11 0.72275 0.2654983
## 17 0.72200 0.2663100
## 20 0.72200 0.2663100
## 29 0.72200 0.2663100
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 11.
Next we explore using k-nearest neighbors model across several values of k. We max our accuracy out at 73% using the nearest 47 neighbors.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
#summaryFunction = twoClassSummary,
search = "random"
)
grid = expand.grid(k = seq(3, 51, 2))
# set seed to compare models
set.seed(123)
knn <- train(RESPONSE ~ ., data = credit.train,
method = "knn",
#metric = "ROC",
trControl = ctrl,
preProc = c("BoxCox", "center", "scale"),
tuneGrid = grid
)
knn
## k-Nearest Neighbors
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: Box-Cox transformation (5), centered (43), scaled (43)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 3 0.70925 0.2598501
## 5 0.70775 0.2316811
## 7 0.70650 0.2240451
## 9 0.70550 0.2050742
## 11 0.70925 0.2059110
## 13 0.70700 0.1885794
## 15 0.70400 0.1655101
## 17 0.70850 0.1754343
## 19 0.70725 0.1645962
## 21 0.70800 0.1599056
## 23 0.71050 0.1652140
## 25 0.71475 0.1737182
## 27 0.71100 0.1513471
## 29 0.71300 0.1516990
## 31 0.71900 0.1643922
## 33 0.72275 0.1680119
## 35 0.71925 0.1535694
## 37 0.71975 0.1507750
## 39 0.72150 0.1519236
## 41 0.72450 0.1575674
## 43 0.72525 0.1575965
## 45 0.72725 0.1599940
## 47 0.72800 0.1582270
## 49 0.72500 0.1469340
## 51 0.72425 0.1406629
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 47.
Next, we move to more complex models to see if we can improve on our initial logistic model with more “black box” methods. A random search gets us closer with an accuracy of 75%.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
#summaryFunction = twoClassSummary,
search = "random"
)
# set seed to compare models
set.seed(123)
nn <- train(RESPONSE ~ ., data = credit.train,
method = "nnet",
preProc = c("BoxCox", "center", "scale"),
#metric = "ROC",
trControl = ctrl,
tuneLength = 10,
trace = FALSE)
nn
## Neural Network
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## Pre-processing: Box-Cox transformation (5), centered (43), scaled (43)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 2.505820e+00 0.74775 0.3506237
## 9 1.162583e-01 0.70950 0.3024253
## 10 5.333618e+00 0.74450 0.3143395
## 11 2.995894e-04 0.69900 0.2849435
## 12 9.279494e-04 0.69875 0.2770412
## 16 5.248134e-03 0.71125 0.2961710
## 18 1.787958e-05 0.69625 0.2590369
## 18 2.727724e-02 0.71900 0.3127534
## 19 4.145225e-05 0.69950 0.2733467
## 20 2.173882e+00 0.75625 0.3794894
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 20 and decay = 2.173882.
Across a random search of random forest parameters, we produce an accuracy of 74%.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
#summaryFunction = twoClassSummary,
search = "random"
)
# set seed to compare models
set.seed(123)
rf <- train(RESPONSE ~ ., data = credit.train,
method = "rf",
#metric = "ROC",
trControl = ctrl,
tuneLength = 10)
rf
## Random Forest
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.72725 0.1558377
## 18 0.74425 0.3225642
## 20 0.74425 0.3230032
## 23 0.74125 0.3190801
## 24 0.74150 0.3207972
## 34 0.73875 0.3213285
## 38 0.73725 0.3186434
## 39 0.73825 0.3218119
## 41 0.73625 0.3210190
## 42 0.73600 0.3160522
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 18.
Lastly, we try a boosting model to see if we can increase our accuracy further. A random grid search across 10 models provides us with an accuracy of around 75%.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
#summaryFunction = twoClassSummary,
search = "random"
)
# set seed to compare models
set.seed(123)
xgb <- train(RESPONSE ~ ., data = credit.train,
method = "xgbTree",
#metric = "ROC",
trControl = ctrl,
tuneLength = 10)
xgb
## eXtreme Gradient Boosting
##
## 800 samples
## 28 predictor
## 2 classes: 'Bad', 'Good'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results across tuning parameters:
##
## eta max_depth gamma colsample_bytree min_child_weight
## 0.08912107 10 2.3162579 0.6431311 7
## 0.17420668 4 3.1818101 0.4063891 18
## 0.32689555 3 7.5845954 0.3932136 2
## 0.35689107 1 2.1640794 0.4863850 15
## 0.38466358 7 6.9070528 0.4654897 16
## 0.39376777 2 0.2461368 0.3609779 11
## 0.41598924 5 9.0229905 0.4658185 9
## 0.42540975 9 4.7779597 0.3555224 4
## 0.57785152 9 1.4280002 0.3183325 13
## 0.59656760 6 7.9546742 0.4475382 2
## subsample nrounds Accuracy Kappa
## 0.5798738 457 0.73750 0.3305658
## 0.8457567 552 0.73650 0.2995724
## 0.8575483 529 0.74425 0.3080856
## 0.8592921 893 0.73025 0.2848044
## 0.5379772 409 0.73725 0.2923941
## 0.8609800 941 0.70975 0.2725137
## 0.3211305 789 0.73200 0.2816812
## 0.5863873 46 0.74725 0.3399415
## 0.8158564 957 0.72575 0.3003962
## 0.4557877 884 0.73225 0.3266194
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 46, max_depth = 9,
## eta = 0.4254098, gamma = 4.77796, colsample_bytree =
## 0.3555224, min_child_weight = 4 and subsample = 0.5863873.
Since we resample the same folds for evaluating our models, we can compare the accuracies of the different tests. Our logistic models show the highest median accuracies. This is benefical as the logistic model will be more transparent in terms of variable importance.
resamp <- resamples(list(Logistic_all = glm.all,
Logistic_back = glm.back,
Logistic_for = glm.for,
CART = ct,
CART_2 = ct2,
K_NN = knn,
Neural_Net = nn,
Random_Forest = rf,
XGBoost = xgb))
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: Logistic_all, Logistic_back, Logistic_for, CART, CART_2, K_NN, Neural_Net, Random_Forest, XGBoost
## Number of resamples: 50
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Logistic_all 0.6500 0.725000 0.76250 0.75650 0.796875 0.8500 0
## Logistic_back 0.6625 0.740625 0.76250 0.76375 0.787500 0.8625 0
## Logistic_for 0.6500 0.725000 0.75625 0.75975 0.787500 0.8625 0
## CART 0.6000 0.700000 0.72500 0.72200 0.750000 0.8000 0
## CART_2 0.6000 0.700000 0.73125 0.72275 0.750000 0.8125 0
## K_NN 0.6625 0.712500 0.72500 0.72800 0.737500 0.8000 0
## Neural_Net 0.6500 0.725000 0.76250 0.75625 0.787500 0.8500 0
## Random_Forest 0.5875 0.715625 0.75000 0.74425 0.775000 0.8250 0
## XGBoost 0.6375 0.725000 0.73750 0.74725 0.775000 0.8250 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu.
## Logistic_all 0.14473684 0.31250000 0.4017480 0.3900469 0.4869600
## Logistic_back 0.14473684 0.35363248 0.4135802 0.4050818 0.4938272
## Logistic_for 0.16666667 0.29927885 0.3944890 0.3925573 0.4620253
## CART -0.14285714 0.18942710 0.2644280 0.2663100 0.3421053
## CART_2 -0.14285714 0.18942710 0.2792081 0.2654983 0.3411378
## K_NN -0.03846154 0.08730159 0.1666667 0.1582270 0.2164179
## Neural_Net 0.10256410 0.31250000 0.3953130 0.3794894 0.4719878
## Random_Forest -0.13013699 0.23883032 0.3423791 0.3225642 0.4192814
## XGBoost 0.08227848 0.26159901 0.3323557 0.3399415 0.4078947
## Max. NA's
## Logistic_all 0.6153846 0
## Logistic_back 0.6428571 0
## Logistic_for 0.6428571 0
## CART 0.4871795 0
## CART_2 0.5238095 0
## K_NN 0.4117647 0
## Neural_Net 0.6052632 0
## Random_Forest 0.5512821 0
## XGBoost 0.5625000 0
model.differences <- diff(resamp)
summary(model.differences)
##
## Call:
## summary.diff.resamples(object = model.differences)
##
## p-value adjustment: bonferroni
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
##
## Accuracy
## Logistic_all Logistic_back Logistic_for CART CART_2
## Logistic_all -0.00725 -0.00325 0.03450 0.03375
## Logistic_back 1.0000000 0.00400 0.04175 0.04100
## Logistic_for 1.0000000 1.0000000 0.03775 0.03700
## CART 0.0001308 1.238e-06 1.576e-05 -0.00075
## CART_2 0.0011054 2.398e-05 0.0001636 1.0000000
## K_NN 0.0007890 8.008e-06 0.0001724 1.0000000 1.0000000
## Neural_Net 1.0000000 1.0000000 1.0000000 0.0001622 0.0005941
## Random_Forest 1.0000000 0.1152607 0.9091733 0.0013742 0.0033661
## XGBoost 1.0000000 0.5080848 1.0000000 0.0070903 0.0088843
## K_NN Neural_Net Random_Forest XGBoost
## Logistic_all 0.02850 0.00025 0.01225 0.00925
## Logistic_back 0.03575 0.00750 0.01950 0.01650
## Logistic_for 0.03175 0.00350 0.01550 0.01250
## CART -0.00600 -0.03425 -0.02225 -0.02525
## CART_2 -0.00525 -0.03350 -0.02150 -0.02450
## K_NN -0.02825 -0.01625 -0.01925
## Neural_Net 0.0008529 0.01200 0.00900
## Random_Forest 0.4047475 1.0000000 -0.00300
## XGBoost 0.0667534 1.0000000 1.0000000
##
## Kappa
## Logistic_all Logistic_back Logistic_for CART
## Logistic_all -0.0150350 -0.0025104 0.1237369
## Logistic_back 1.0000000 0.0125246 0.1387719
## Logistic_for 1.0000000 1.0000000 0.1262473
## CART 1.072e-06 5.652e-08 6.251e-07
## CART_2 3.853e-06 2.966e-07 2.073e-06 1.0000000
## K_NN < 2.2e-16 < 2.2e-16 2.370e-16 0.0001967
## Neural_Net 1.0000000 0.6525794 1.0000000 1.183e-05
## Random_Forest 0.0120946 0.0004971 0.0126930 0.0022326
## XGBoost 0.2763252 0.0119210 0.1503092 0.0050315
## CART_2 K_NN Neural_Net Random_Forest XGBoost
## Logistic_all 0.1245486 0.2318199 0.0105575 0.0674827 0.0501054
## Logistic_back 0.1395835 0.2468549 0.0255925 0.0825176 0.0651403
## Logistic_for 0.1270589 0.2343303 0.0130679 0.0699931 0.0526157
## CART 0.0008117 0.1080830 -0.1131794 -0.0562542 -0.0736315
## CART_2 0.1072713 -0.1139910 -0.0570659 -0.0744432
## K_NN 0.0002971 -0.2212624 -0.1643372 -0.1817145
## Neural_Net 1.690e-05 < 2.2e-16 0.0569252 0.0395478
## Random_Forest 0.0035424 5.364e-10 0.0991966 -0.0173773
## XGBoost 0.0025545 9.205e-13 1.0000000 1.0000000
bwplot(resamp, main = "Model Metric Comparison")
dotplot(resamp, metric = "Accuracy", main = "Model Accuracy Confidence Intervals")
We can also review the variable importance of each of our models. We see CHK_ACCT, SAV_ACCT, AGE, and DURATION are common variables in the top 10 for each model.
log_back_imp <- varImp(glm.back)
log_all_imp <- varImp(glm.all)
xgb_imp <- varImp(xgb)
nn_imp <- varImp(nn)
rf_imp <- varImp(rf)
plot(log_back_imp, top = 10, main = "Backward Stepwise Logistic Regression")
plot(log_all_imp, top = 10, main = "Full Logisitc Regression")
plot(xgb_imp, top = 10, main = "XgBoost")
plot(nn_imp, top = 10, main = "Neural Net")
plot(rf_imp, top = 10, main = "Random Forest")
Our initial tests show the logistic models perform well in our dataset compared to more complex models. We will use the test set we partitioned to evaluate the expected profits of the backwards logistic regression, gradient boosted, and random forest models. We will refer back to the profit for loans approved to select the best model and make recommendations for reviewing applications.
As a refresher, for applications that are approved, a Good loan returns +100 in profit, and a Bad loan costs -500 in profit.
To evaluate our models, we will view the ROC scores of our training and test sets to see how well the classes separate. Then we will determine the optimal cutoff points for probabilities to maximixe separation.
# function to plot ROC curves from ROCR objects
plot_roc <- function(train_roc, train_auc, test_roc, test_auc) {
plot(train_roc, col = "blue", lty = "solid", main = "", lwd = 2,
xlab = "False Positive Rate",
ylab = "True Positive Rate")
plot(test_roc, col = "red", lty = "dashed", lwd = 2, add = TRUE)
abline(c(0,1))
# draw legend
train.legend <- paste("Training AUC = ", round(train_auc, digits = 3))
test.legend <- paste("Test AUC = ", round(test_auc, digits = 3))
legend("bottomright", legend = c(train.legend, test.legend),
lty = c("solid", "dashed"), lwd = 2, col = c("blue", "red"))
}
# function to determine optimal cutoff point
opt.cut <- function(perf, pred){
cut.ind <- mapply(FUN = function(x, y, p) {
d <- (x - 0) ^ 2 + (y - 1) ^ 2
ind <- which(d == min(d))
c(sensitivity = y[[ind]], specificity = 1 - x[[ind]],
cutoff = p[[ind]])
}, perf@x.values, perf@y.values, pred@cutoffs)
}
Our logistic regression model produces a decent ROC score with an optimal cutoff probability of 0.67.
# training metrics for backward step logistic regression
credit.train$log_back_prob <- predict.train(glm.back,
newdata = credit.train,
type = "prob")[ ,2]
credit.train.log_back_pred <- prediction(credit.train$log_back_prob, credit.train$RESPONSE)
credit.train.log_back.auc <- as.numeric(performance(credit.train.log_back_pred, "auc")@y.values)
credit.train.roc <- performance(credit.train.log_back_pred, "tpr", "fpr")
# test accuracy for backward step logistic regression
credit.valid$log_back_prob <- predict.train(glm.back,
newdata = credit.valid,
type = "prob")[ ,2]
credit.valid.log_back_pred <- prediction(credit.valid$log_back_prob, credit.valid$RESPONSE)
credit.valid.log_back.auc <- as.numeric(performance(credit.valid.log_back_pred, "auc")@y.values)
credit.valid.roc <- performance(credit.valid.log_back_pred, "tpr", "fpr")
# plot ROC/AUC scores
plot_roc(train_roc = credit.train.roc,
train_auc = credit.train.log_back.auc,
test_roc = credit.valid.roc,
test_auc = credit.valid.log_back.auc)
print(opt.cut(credit.valid.roc, credit.valid.log_back_pred))
## [,1]
## sensitivity 0.8000000
## specificity 0.7000000
## cutoff 0.6664984
The gradient boosted model drops slightly in performance across the test set compared to the logistic model. This is similar to what we saw in the model building process.
# training metrics for xgboost model
credit.train$xgb_prob <- predict.train(xgb,
newdata = credit.train,
type = "prob")[ ,2]
credit.train.xgb_pred <- prediction(credit.train$xgb_prob, credit.train$RESPONSE)
credit.train.xgb.auc <- as.numeric(performance(credit.train.xgb_pred, "auc")@y.values)
credit.train.roc <- performance(credit.train.xgb_pred, "tpr", "fpr")
# test accuracy for xgboost model
credit.valid$xgb_prob <- predict.train(xgb,
newdata = credit.valid,
type = "prob")[ ,2]
credit.valid.xgb_pred <- prediction(credit.valid$xgb_prob, credit.valid$RESPONSE)
credit.valid.xgb.auc <- as.numeric(performance(credit.valid.xgb_pred, "auc")@y.values)
credit.valid.roc <- performance(credit.valid.xgb_pred, "tpr", "fpr")
# plot ROC/AUC scores
plot_roc(train_roc = credit.train.roc,
train_auc = credit.train.xgb.auc,
test_roc = credit.valid.roc,
test_auc = credit.valid.xgb.auc)
print(opt.cut(credit.valid.roc, credit.valid.xgb_pred))
## [,1]
## sensitivity 0.7214286
## specificity 0.7166667
## cutoff 0.6946105
For the random forest, we see it overfits the training set and performs much worse on the test set. Again, the logistic regression still shows the best ROC score.
credit.valid$rf_prob <- predict.train(rf,
newdata = credit.valid,
type = "prob")[ ,2]
# training metrics for xgboost model
credit.train$rf_prob <- predict.train(rf,
newdata = credit.train,
type = "prob")[ ,2]
credit.train.rf_pred <- prediction(credit.train$rf_prob, credit.train$RESPONSE)
credit.train.rf.auc <- as.numeric(performance(credit.train.rf_pred, "auc")@y.values)
credit.train.roc <- performance(credit.train.rf_pred, "tpr", "fpr")
# test accuracy for xgboost model
credit.valid$rf_prob <- predict.train(rf,
newdata = credit.valid,
type = "prob")[ ,2]
credit.valid.rf_pred <- prediction(credit.valid$rf_prob, credit.valid$RESPONSE)
credit.valid.rf.auc <- as.numeric(performance(credit.valid.rf_pred, "auc")@y.values)
credit.valid.roc <- performance(credit.valid.rf_pred, "tpr", "fpr")
# plot ROC/AUC scores
plot_roc(train_roc = credit.train.roc,
train_auc = credit.train.rf.auc,
test_roc = credit.valid.roc,
test_auc = credit.valid.rf.auc)
print(opt.cut(credit.valid.roc, credit.valid.rf_pred))
## [,1]
## sensitivity 0.7642857
## specificity 0.6666667
## cutoff 0.6140000
To confirm our evaluation and guide our recommendations, we can calculate the expected profit of our predictions across each decile of the test set. Looking at the individual profit and cumulative profit provides us insight into where our model produces the most value as well as when we may want human-intervention to take over for the model.
The gradient boosted model actually peaks out at the highest expected profit through 40% of the predictions ranked by probability. The random forest model performs the best through the first 20% of the test set. Across the entire test sample, the logistic regression provides the highest overall expected profit. Each model performs similar through the final 30% of the test sample.
# use optimal estimates for each models predictions
credit.valid$log_back_pred <- factor(ifelse(credit.valid$log_back_prob > 0.67, "Good", "Bad"))
credit.valid$xgb_pred <- factor(ifelse(credit.valid$xgb_prob > 0.69, "Good", "Bad"))
credit.valid$rf_pred <- factor(ifelse(credit.valid$rf_prob > 0.61, "Good", "Bad"))
# use predictions to assess expected profit
credit.valid$log_back_profit <- ifelse(credit.valid$log_back_pred == "Good" & credit.valid$RESPONSE == "Good", 100,
ifelse(credit.valid$log_back_pred == "Good" & credit.valid$RESPONSE == "Bad", -500, 0))
credit.valid$xgb_profit <- ifelse(credit.valid$xgb_pred == "Good" & credit.valid$RESPONSE == "Good", 100,
ifelse(credit.valid$xgb_pred == "Good" & credit.valid$RESPONSE == "Bad", -500, 0))
credit.valid$rf_profit <- ifelse(credit.valid$rf_pred == "Good" & credit.valid$RESPONSE == "Good", 100,
ifelse(credit.valid$rf_pred == "Good" & credit.valid$RESPONSE == "Bad", -500, 0))
# calculate cumulative profit for each model to assess optimal profit for triage cutoff
logistic.profit <- credit.valid %>%
select(log_back_prob, log_back_profit) %>%
arrange(desc(log_back_prob)) %>%
mutate(model = "Logistic",
case = row_number(),
profit = log_back_profit,
decile = cut(1:n(), breaks = quantile(1:n(), probs = seq(0, 1, .1)),
include.lowest = TRUE,
labels = c(10:1))) %>%
select(model, case, profit, decile)
xgb.profit <- credit.valid %>%
select(xgb_prob, xgb_profit) %>%
arrange(desc(xgb_prob)) %>%
mutate(model = "XgBoost",
case = row_number(),
profit = xgb_profit,
decile = cut(1:n(), breaks = quantile(1:n(), probs = seq(0, 1, .1)),
include.lowest = TRUE,
labels = c(10:1))) %>%
select(model, case, profit, decile)
rf.profit <- credit.valid %>%
select(rf_prob, rf_profit) %>%
arrange(desc(rf_prob)) %>%
mutate(model = "Random Forest",
case = row_number(),
profit = rf_profit,
decile = cut(1:n(), breaks = quantile(1:n(), probs = seq(0, 1, .1)),
include.lowest = TRUE,
labels = c(10:1))) %>%
select(model, case, profit, decile)
# combine expected profit dataframes for plotting
profit <- rbind(logistic.profit, xgb.profit, rf.profit) %>%
group_by(model, decile) %>%
summarise(dec_profit = sum(profit)) %>%
mutate(profit = cumsum(dec_profit))
# plot expected profit by decile
ggplot(profit, aes(x = decile, y = dec_profit, group = model, color = model)) +
geom_line(aes(linetype = model)) +
geom_point(aes(shape = model)) +
geom_hline(yintercept = 1, linetype = 2, size = 0.1) +
labs(title = "Decile Expected Profit - Test Set", x = "Decile", y = "Profit")
# plot cumulative expected profit by decile
ggplot(profit, aes(x = decile, y = profit, group = model, color = model)) +
geom_line(aes(linetype = model)) +
geom_point(aes(shape = model)) +
geom_hline(yintercept = 1, linetype = 2, size = 0.1) +
labs(title = "Cumulative Expected Profit - Test Set", x = "Decile", y = "Profit")
If we view the predicted probabilities of Good credit for the logisitc model, we can determine the best way to triage the model’s predictions in deployment. We see that our profit is maximized through the top 40% of predicted probabilities for the logisitc regression model. Additionally, our profit remains flat through the last 40%. Using the distribution of the predicted probabilities, we surmise that the model can be used to auto-accept applications with a predicted probability above 80-85% (82% based on our sample) and auto-reject applications with a predicted probability lower than 70-75% (71% for our sample). We can then continue to manually review applications that fall within the two cutoff points. This method will reduce cost to review applications, speed up the process of notifying customers of their status, and allow us to continue to get samples to refine and improve the model.
sort(quantile(credit.valid$log_back_prob, probs = seq(0, 1, .1)), decreasing = TRUE)
## 100% 90% 80% 70% 60% 50%
## 0.99336040 0.96432018 0.93439597 0.87819237 0.82405482 0.77761621
## 40% 30% 20% 10% 0%
## 0.71493049 0.58946064 0.42023900 0.32647786 0.08492754